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Abstract 

Unsupervised dictionary learning has been a key com- 
ponent in state-of-the-art computer vision recognition ar- 
chitectures. While highly effective methods exist for patch- 
based dictionary learning, these methods may learn redun- 
dant features after the pooling stage in a given early vi- 
sion architecture. In this paper, we offer a novel dictionary 
learning scheme to efficiently take into account the invari- 
ance of learned features after the spatial pooling stage. The 
algorithm is built on simple clustering, and thus enjoys ef- 
ficiency and scalability. We discuss the underlying mecha- 
nism that justifies the use of clustering algorithms, and em- 
pirically show that the algorithm finds better dictionaries 
than patch-based methods with the same dictionary size. 

1. Introduction 

In the recent decade local patch-based, spatially pooled 
feature extraction pipelines have been shown to provide 
good image features for classification. Methods following 
such a pipeline usually start from densely extracted local 
image patches (either normalized raw pixel values or hand- 
crafted descriptors such as SIFT or HOG), and perform dic- 
tionary learning to obtain a dictionary of codes (filters). The 
patches are then encoded into an over-complete representa- 
tion using various algorithms such as sparse coding [i-r, ^ . ] 
and simple inner product with a non-linear post-processing 
[4, 10]. After encoding, spatial pooling with average or max 
operations are carried out to form a global image represen- 
tation [ , ]. The encoding and pooling pipeline can be 
stacked to produce a final feature vector, which is then used 
to predict the labels for the images usually via a linear clas- 
sifier. 

There is an abundance of literature on single-layered net- 
works for unsupervised feature encoding. Dictionary learn- 
ing algorithms have been discussed to find a set of basis that 
reconstructs local image patches or descriptors well [ , ], 
and several encoding methods have been proposed to map 
the original data to a high-dimensional space that empha- 
sizes certain properties, such as sparsity [14, 19, 20] or lo- 



cality [1^]. 

A particularly interesting finding in the recent papers 
[3, 15, 4, 16] is that very simple patch-based dictionary 
learning algorithms like K-means or random selection, 
combined with feed-forward encoding methods with a naive 
nonlinearity, produces state-of-the-art performance on var- 
ious datasets. Explanation of such phenomenon often fo- 
cuses on the local image patch statistics, such as the fre- 
quency selectivity of random samples [ ]. 

A potential problem with such patch-based learning 
methods is that it may learn redundant features when we 
consider the pooling stage, as two codes that are uncorre- 
cted may become highly correlated after pooling due to the 
introduction of spatial invariance. While using a larger dic- 
tionary almost always alleviates this problem, in practice we 
often want the dictionary to have a limited number of codes 
due to various reasons. First, feature computation has be- 
come the dominant factor in the state-of-the-art image clas- 
sification pipelines, even with purely feed-forward methods 
{e.g., threshold encoding [ ]) or speedup algorithms {e.g., 
LLC [ ]). Second, reasonably sized dictionary helps to 
more easily learn further tasks that depends on the encoded 
features; this is especially true when we have more than 
one coding-pooling stage such as stacked deep networks, 
or when one applies more complex pooling stages such as 
second-order pooling [ ], as a large encoding output would 
immediately drive up the number of parameters in the next 
layer. 

Thus, it would be beneficial to design a dictionary learn- 
ing algorithm that takes pooling into consideration and 
learns a compact dictionary. Prior work on addressing such 
problem often resorts to convolutional approaches [12, 22]. 
These methods are usually able to find dictionaries that 
bear a better level of spatial invariance than patch-based K- 
means, but are often non-trivial to train when the dictionary 
size is large, since a convolution operation has to be carried 
out instead of simple inner products. 

In our paper we present a new method that is analogous 
to the patch-based K-means method for dictionary learning, 
but takes into account the redundancy that may be intro- 
duced in the pooling stage. We show how a K-centroids 
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Figure 1: The feature extraction pipeline, composed of dense local patch extraction, encoding, and pooling. Illustrated is the 
average pooling over the whole image for simplicity, and in practice the pooling can be carried out over finer grids of the 
image as well as with different operations (such as max). 



clustering method applied on the covariance between can- 
didate codes can efficiently learn pooling invariant repre- 
sentations. We also show how one can view dictionary 
learning as a matrix approximation problem, which finds 
the best approximation to an "oracle" dictionary (which is 
often very large). It turns out that under this perspective, 
the performance of various dictionary learning methods can 
be explained by the recent findings in Nystrom subsampling 
[11,23]. 

We will first review the feature extraction pipeline and 
the effect of pooling on the learned dictionary in Section 2, 
and then describe the proposed two-stage dictionary learn- 
ing method in Section 3. The effectiveness of this simple 
yet effective dictionary learning algorithms on the standard 
CIFAR-10 and STL benchmark datasets, as well as the fine- 
grained classification task, in which we show that feature 
learning plays an important role. 

2. Background 

We illustrate the feature extraction pipeline that is com- 
posed of encoding dense local patches and pooling encoded 
features in Figure 1 . Specifically, starting with an input im- 
age I, we formally define the encoding and pooling stages 
as follows. 

(1) Coding. In the coding step, we extract local image 
patches \ and encode each patch to K activation values 
based on a dictionary of size K (learned via a separate dic- 
tionary learning step). These activations are typically binary 
(in the case of vector quantization) or continuous (in the 
case of e.g. sparse coding), and it is generally believed that 
having an over-complete {K > the dimension of patches) 
dictionary while keeping the activations sparse helps clas- 
sification, especially when linear classifiers are used in the 
later steps. 

We will mainly focus on what we call the decoupled 

^Although we use the term "patches" throughout the paper, the pipeline 
works with local image descriptors, such as SIFT, as well. 



encoding methods, in which the activation of one code 
does not rely on other codes, such as threshold encod- 
ing [ ], which computes the inner product between p and 
each code, with a fixed threshold parameter a: //c(x) = 
max{0, d^p — a}. Such methods have been increasingly 
popular mainly for their efficiency over coupled encoding 
methods such as sparse coding, for which a joint optimiza- 
tion needs to be carried out. Their employment in several 
deep models (e.g. [ ]) also suggests that such simple non- 
linearity may suffice to learn a good classifier in the later 
stages. 

Learning the dictionary: Recently, it has been found 
that relatively simple dictionary learning and encoding ap- 
proaches lead to surprisingly good performances [3, 16]. 
For example, to learn a dictionary V = {di, d2, • • • , d^} 
of size K from randomly sampled patches V = 
{pi, P2, • • • , Pat} each reshaped as a vector of pixel val- 
ues, one could simply adopt the K-means algorithm, which 
aims to minimize the squared distance between each patch 
and its nearest code: mino Y^iLi min^ ||p^ — dj||2. We 
refer to [ ] for a detailed comparison about different dictio- 
nary learning and encoding algorithms. 

(2) Pooling. Since the coding result are highly over- 
complete and highly redundant, the pooling layer aggre- 
gates the activations over a spatial region of the image to 
obtain a K dimensional vector x. Each dimension of the 
pooled feature is obtained by taking the activations of the 
corresponding code in the given spatial region (also called 
receptive field in the literature), and performing a prede- 
fined operator (usually average or max) on the set of activa- 
tions. 

Figure 1 shows an example when average pooling is car- 
ried out over the whole image. In practice we may define 
multiple spatial regions per image (such as a regular grid), 
and the global representation for the image will then be a 
vector of size K times the number of spatial regions. 

Since the feature extraction for image classification of- 
ten involves the spatial pooling stage, the patch-level die- 




Figure 2: Two codes learned from a patch-based K-means algorithm that produce lowly correlated patch-based responses, but 
highly correlated responses after pooling. Notice that such phenomenon does not only exist between codes with translational 
difference (as in (a)), but also between other appearances (such as color in (b)). 



tionary learning may not find good dictionaries that produce 
most informative pooled outputs. In fact, one would reason- 
ably argue that it doesn't, one immediate reason being that 
patch-based dictionary learning algorithms often yield sim- 
ilar Gabor filters with small translations. Such filters, when 
pooled over a certain spatial region, produce highly corre- 
lated responses and lead to redundancy in the feature repre- 
sentation. Figure 2 shows two such examples, where filters 
produce uncorrected patch-based responses but highly cor- 
related pooled responses. 

Convolutional approaches [12, 2^] are usually able to 
find dictionaries that are more spatially invariant than patch- 
based K-means, but learning may not scale as well as sim- 
ple clustering algorithms, especially with hundreds or thou- 
sands of codes. In addition, convolutional approaches may 
still not solve the problem of inter-code invariance: for ex- 
ample, the response of a colored edge filter might have high 
correlation with that of a gray scale edge filter, and such 
correlation could not be modeled by spatial invariance. 

3. Pooling-Invariant Dictionary Learning 

We are interested in designing a simple yet effective dic- 
tionary learning algorithm that takes into consideration the 
pooling stage of the feature extraction pipeline, and that 
models the general invariances among the pooled features. 
Observing the effectiveness of clustering methods in dictio- 
nary learning, we propose to learn a final dictionary of size 
K in two stages: first, we adopt the patch-based K-means 
algorithm to learn a more over-complete starting dictionary 
of size M (M > K)\ we then perform encoding and pool- 
ing using the dictionary, learn the final, smaller dictionary 
of size K from the statistics of the M pooled features. 

The motivation of such idea is that K-means is a highly 
parallelizable algorithm that could be scaled up by simply 
sharding the data, allowing us to have an efficient algorithm 
for dictionary learning. Using a starting dictionary allows 
us to preserve most information on the patch-level, and the 
second step prunes away the redundancy due to pooling. 



Note that the large dictionary is only used during the feature 
learning time - after this, for each input image, we only need 
to encode local patches with the selected, relatively smaller 
dictionary, not any more expensive than existing feature ex- 
traction methods. 

3.1. Feature Selection with Affinity Propagation 

The first step of our algorithm is identical to the patch- 
based K-means algorithm with a starting dictionary size M. 
After this, we can sample a set of image super-patches of 
the same as the pooling regions, and obtain the M dimen- 
sional pooled features from them. Randomly sampling a 
large number of pooled features in this way allows us to 
analyze the pairwise similarities between the codes in the 
starting dictionary in a post-pooling fashion. We would then 
like to find a iiT-dimensional subspace that best represents 
the M pooled features. Specifically, the similarity between 
two pooled dimensions (which correspond to two codes in 
the starting dictionary) i and code j as 



(1) 



where C is the covariance matrix computed from the ran- 
dom sample of pooled features. We note that this is equiv- 
alent to the negative Euclidean distance between the coded 
output i and the coded output j when the outputs are nor- 
malized to have zero mean and standard deviation 1. We 
then use affinity propagation [ ], which is a version of the 
K-centroids algorithm, to select centroids from existing fea- 
tures. Intuitively, codes that produce redundant pooled out- 
put (such as translated versions of the same code) would 
have high similarity between them, and only one exemplar 
would be chosen by the algorithm. 

Specifically, affinity propagation finds centroids from a 
set of candidates where pairwise similarity s(i, j) (1 < 
^, j < ^) can be computed. It iteratively updates two 
terms, the "responsibility" r(z,j) and the "availability" 
a(z, j) via a message passing method following such rules 



[9]: 

r(i, k) ^ s{i, k) — max{a(i, k') + s{i^ k')} (2) 

a(i, k) ^ min{0, r{k, k) + max{0, r(i' , k)}} 

(ifi^k) (3) 
a{k, k) ^ Y^.^^u ^^""^^^ ""^'^ 

Upon convergence, the centroid that represents any candi- 
date i is given by arg max/c(a(i, k)-\-r{i^ k)), and the set of 
centroids S is obtained by 

S = {k\3i^ k s.t. k = argmax(a(i, k') + r(i, k'))} (5) 

And we refer to [ ] for details about the nature of such mes- 
sage passing algorithms. 

3.2. Visualization of Selected Filters 

To visually show what codes are selected by affinity 
propagation, we applied our approach to the CIFAR-10 
dataset by first training an over-complete dictionary of 3200 
codes, and then performing affinity propagation on the 
3200-dimensional pooled features to obtain 256 centroids, 
which we visualize in Figure 3. Translational invariance 
appears to be the most dominant factor, as many clusters 
contain translated versions of the same Gabor like code, es- 
pecially for gray scale codes. On the other hand, clusters 
capture more than translation: clusters such as column 5 
focus on finding the contrasting colors more than finding 
edges of exactly the same angle, and clusters such as the last 
column finds invariant edges of varied color. We note that 
the selected codes are not necessarily centered (which is the 
case for convolutional approaches), as the centroids are se- 
lected solely from the pooled response covariance statistics, 
which does not explicitly favor centered patches. 

4. Why Does Clustering Work? 

We will briefly discuss the reason why simple clustering 
algorithms work well in finding a good dictionary. Essen- 
tially, given a dictionary D of size K and specifications on 
the encoding and pooling operations, the feature extraction 
pipeline could be viewed as a projection from the original 
raw image pixels to the pooled feature space R^. Note that 
this space does not degenerate since the feature extraction is 
highly nonlinear, and one could also view this from a kernel 
perspective as defining a specific kernel between images. 
Then, one way to evaluate the information contained in this 
embedded feature space is to use the covariance matrix C of 
the output features, which plays the dual role of the kernel 
matrix between the encoded patches. 

Considering that larger dictionaries almost always in- 
creases performance [ ], and in the limit one could use all 



the patches V as the dictionary, leading to a A^-dimensional 
space and an x covariance matrix Cj> . Since in prac- 
tice we always assume a budget on the dictionary size, the 
goal is to find a dictionary V of size AT, which yields a K- 
dimensional space and a covariance matrix Ct>- The ap- 
proximation to the "oracle" encoded space using V could 
then be computed as: 

~ CpdC^Cdp (6) 

where C'px) is the covariance matrix between the features 
extracted by V and the one extracted by V, and C+ denotes 
the matrix pseudo-inverse. 

Note that such explanation works for both the before- 
pooling case and the after-pooling case. The dictionary 
learning algorithm can then be thought as finding a dictio- 
nary that approximates C'p best. Interestingly, this could 
be thought as a form of the Nystrom method that subsam- 
ples subsets of the matrix columns for approximation. The 
Nystrom method has been used to approximate large matri- 
ces for spectral clustering [ ], and here enables us to explain 
the mechanism of dictionary learning. Recent research in 
the machine learning field, notably [ ], supports the re- 
cent empirical observations in vision: first, it is known that 
uniformly sampling the columns of the matrix Cj> already 
works well in reconstruction, which explains the good per- 
formance of random patches in feature learning [ ] ; sec- 
ond, theoretical results [ , ] have shown that clustering 
algorithms works particularly better than other methods as 
a data-driven way in finding good subsets to approximate 
the original matrix, justifying the use of clustering in the 
dictionary learning works. 

In the patch-based dictionary learning, clustering could 
be directly applied on the set of patches (i.e. the largest dic- 
tionary in the limit). When we consider pooling, though, 
using all the patches becomes non-trivial, since we need to 
compute pooled features using each patch as a code, which 
is computationally overwhelming. Thus, a reasonable ap- 
proach to consider is to first find a subset of all the patches 
using patch-based clustering (although this "subset" is still 
larger than our final dictionary size), and perform clustering 
on the pooled outputs of this subset only, which leads to the 
proposed algorithm. The performance of the algorithm is 
then supported by the Nystrom theory above. 

Based on the matrix approximation explanation, we fur- 
ther reshape our selected features to match the distribution 
in the higher-dimensional space corresponding to the start- 
ing dictionary. After selecting K centroids denoted by sub- 
set S, we can group the original M x M covariance matrix 
as 







, w = 











where denotes the covariance between the subset S 
and the subset S and so on. We then approximate the origi- 
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Figure 3: Visualization of the learned codes. Left: the selected subset of 256 centroids from an original set of 3200 codes. 
Right: The similarity between each centroid and the other codes in its cluster. For each column, the first code is the selected 
centroid, and the remaining codes are in the same cluster represented by it. Notice that while translational invariance is the 
most dominant factor, our algorithm does find invariances beyond that (e.g., notice the different colors on the last column). 
Best viewed in color. 



nal high-dimensional covariance matrix as 

C ^ WC+^W^ (8) 

More importantly, given the pooled outputs using the 
selected filters, the high-dimensional feature could be ap- 
proximated by 

X ^ Ax5, where A = WC^^ (9) 

Notice that the implicit dimensionality of the data is still no 
larger than the number K of selected codes, so we can apply 
SVD on the matrix A as A = UAV where Ais sl K x K 
diagonal matrix and JJ is sl M x K column-wise orthonor- 
mal matrix, and compute the low-dimensional feature (with 
a little abuse of terminology) as 

X5 = AVX5 (10) 

The K X K transform matrix AV could be pre-computed 
during feature selection, and imposes minimum overhead 
during the actual feature extraction. This transform does 
not change the dimensionality or the rank of the data, and 
only changes the shape of the underlying data distribution. 
In practice, we found the transformed data yields a slightly 
better performance than the untransformed data when com- 
bined with a linear SVM with L2 regularization. 

If we simply would like to find a low-dimensional rep- 
resentation from the M-dimensional pooled features, one 
would naturally choose PCA to find the K most significant 
projections: 

C^UkAkUJ (11) 

where the M x i^T matrix Uk contains the eigenvectors and 
the K X K diagonal matrix A^ contains the eigenvalues. 
The low-dimensional features are then computed as x^ = 

We note that while this guarantees the best K- 
dimensional approximation, it does not help in our task 
since the number of filters are not reduced, as PCA almost 



always yields non-zero coefficients for all the dimensions. 
Linearly combining the codes does not work either, due to 
the nonlinear nature of the encoding algorithm. However, 
as we will show in Section 5.3, results with PCA show that 
a larger starting dictionary almost always help performance 
even when the feature is then reduced to a lower dimen- 
sional space of the same size as a smaller dictionary, which 
justifies the use of matrix approximation to explain the dic- 
tionary learning behavior. 

5. Experiments 

We apply our pooling-invariant dictionary learning 
(PDL) algorithm on several benchmark tasks, including the 
CIFAR-10 and STL datasets on which performance can be 
systematically analyzed, and the fine-grained classification 
task of classifying bird species, on which we show that fea- 
ture learning provides a significant performance gain com- 
pared to conventional methods. 

5.1. CIFAR-10 and STL 

The CIFAR-10^ and the STL^ datasets are extensively 
used to analyze the behavior of feature extraction pipelines. 
CIFAR-10 contains a large number of training and testing 
data, while STL contains a very small amount of training 
data and a large amount of unlabeled images. As our algo- 
rithm works with any encoding and pooling operations, we 
adopted the standard setting usually carried on the dataset: 
extracting local 6x6 patches with mean subtracted and con- 
trast normalized, whitening the patches with ZCA, and then 
train the dictionary with normalized K-means. The features 
are then encoded using one-sided threshold encoding with 
a = 0.25 and average pooled over the four quadrants (2 x 2) 
of the image. For STL, we followed [ ] and resized them to 
32 X 32. For the PDL algorithm, instead of learning a dif- 
ferent set of codes for each pooling quadrant, we learn an 

^http : / /www . cs . toronto . edu/ ~kriz /cif ar . html 
^http : / /www. stanf ord. edu / ~acoates/stllO/, [3] 




Figure 4: (a)-(c): The filter responses before and after pooling: (a) before pooling, between codes in the same cluster 
(correlation p = 0.282), (b) after pooling, between codes in the same cluster (p = 0.756), and (c) after pooling, between the 
selected centroids (p = 0.165), (d): the approximation of the eigenvalues using the Nystrom method. 



identical set of codes in general for pooling the coded out- 
puts. For all experiments, we carry out 5 independent runs 
and take the mean accuracy to report here. 

5.2. Statistics for Feature Selection 

We first verify whether the learned codes capture the 
pooling invariance as we claimed in the previous section. 
To this end, we start from the selected features in Section 

3.2, and randomly sample three types of filter responses: 
(a) pairwise filter responses before pooling between codes 
in the same cluster, (b) pairwise filter responses after pool- 
ing between codes in the same cluster, and (c) pairwise fil- 
ter responses after pooling between the selected centroids. 
The distribution of such responses are plotted in Figure 4. 
The result verifies our conjecture well: first, codes that pro- 
duce uncorrected responses before pooling may become 
correlated after the pooling stage (comparing 4(a) and 4(b)), 
which could effectively be identified by the affinity propa- 
gation algorithm; second, by explicitly taking into consider- 
ation the pooling behavior, we are able to select a subset of 
the features whose responses are lowly correlated (compare 
4(b) and 4(c)), which helps preserve more information with 
a fixed number of codes. 

Figure 4(d) shows the eigenvalues of the original covari- 
ance matrix and those of the approximated covariance ma- 
trix, using the same setting as in the previous subsection"^. 
The approximation captures the largest eigenvalues of the 
original covariance matrix well, while dropping at a higher 
rate for smaller eigenvalues. 

5.3. Classification Performances 

Figure 5 shows the relative improvement obtained on 
CIFAR-10, when we use a budgeted dictionary of size 200, 
but perform feature selection from a larger dictionary as in- 
dicated by the X axis. The PC A performance is also in- 
cluded in the figure as a loose upper bound of the feature 

^Note that since we only select 256 features, the number of nonzero 
eigenvalues is 256 for the approximation. 
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Figure 5: Performance improvement on CIFAR when using 
different starting dictionary sizes and a final dictionary of 
size 200. Shaded areas denote the standard deviation over 
different runs. Note that the x-axis is in log scale. 

selection performance. Learning the dictionary with our 
feature selection method consistently increases the perfor- 
mance as the size of the original dictionary increases, and is 
able to get about 70% the performance gain as obtained by 
PCA (again, notice that PCA still requires all the codes to 
be used, thus does not save feature extraction time). 

The detailed performance of our algorithm on the two 
datasets, using different starting and final dictionary sizes, 
is visualized in Figure 6. Table 1 summarizes the accu- 
racy values of two particular cases - final dictionary sizes 
of 200 and 1600 respectively. Note that our goal is not to 
get the best overall performance - as performance always 
goes up when we use more codes. Rather, we focus on how 
much gain the pooling-aware dictionary learning gets, given 
a fixed dictionary size as the budget. Figure 7 shows the 
performance of the various settings, using PCA to reduce 
the dimensionality instead of PDL. As we stated in the pre- 
vious section, this serves as an upper bound of the feature 
selection algorithms. 

Overall, considering the pooled feature statistics always 
help us to find better dictionaries, especially when only 













. 4 














































,/* . : / 
' / 










yi 
















• ' -fl 

▼- -if 

"i 


K-means 
2x PDL 
4x PDL 
8x PDL - 




,.x - 
♦ X 








i 
* 











Table 1: Classification Accuracy on the CIFAR-10 and STL 
datasets under different budgets. 
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Figure 6: Accuracy values on the CIFAR-10 (top) and STL 
(bottom) datasets under different final dictionary size, "nx 
PDL" means learning the dictionary from a starting dictio- 
nary that is n times larger. 
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Figure 7: Accuracy values using PC A to reduce the dimen- 
sionality to the same as that PDL uses, as an accuracy upper 
bound for the performance of PDL. 



a small dictionary is allowed for classification. Choosing 
from a larger starting dictionary helps increasing perfor- 
mance, although such effect saturates when the dictionary 
size is much larger than the target size. For the STL dataset, 
a large starting dictionary may lessen the performance gain 
(Figure 6(b)). We infer the reason to be that feature selec- 
tion is more prone to local optimum, and the small train- 
ing data of STL may cause the performance to be sensi- 
tive to suboptimal codebook choices. However, in general 
the codebook learned by PDL is consistently better than its 
patch-based counterpart. 

Finally, we note that due to the heavy-tailed nature of the 
encoded and pooled features (see the eigendecomposition 



Task 


Learning Method 


Accuracy 




K-means 


69.02 


CIFAR-10 


2x PDL 


70.54 (+1.52) 


200 codes 


4x PDL 


71.18 (+2.16) 




8x PDL 


71.49 (+2.47) 


CIFAR-10 


K-means 


77.97 


1600 codes 


2x PDL 


78.71 (+0.74) 




K-means 


53.22 


STL 


2x PDL 


54.57 (+1.35) 


200 codes 


4x PDL 


55.53 (+2.31) 




8x PDL 


55.52 (+2.30) 


STL 


K-means 


58.16 


1600 codes 


2x PDL 


58.28 (+0.12) 



of Figure 4), one can infer that the representations obtained 
with a budget would have a correspondingly bounded per- 
formance when combined with linear SVMs. In this paper 
we have focused on analyzing unsupervised approaches. In- 
corporating weakly supervised information to guide feature 
learning / selection or learning multiple layers of feature 
extraction would be particularly interesting, and would be a 
possible future direction. 

5.4. Fine-grained Classification 

To show the performance of the feature learning algo- 
rithm in the real- world image classification tasks, we tested 
the performance of our algorithm on the fine-grained clas- 
sification task, using the 2011 Caltech-UCSD Birds dataset 
[ ]^. Classification of file-grained categories poses a sig- 
nificant challenge for the contemporary vision algorithms, 
as such classification tasks usually requires the identifica- 
tion of localized appearances like "birds with yellow spotted 
feather on the belly" which is hard to capture using manu- 
ally designed features. Recent work on fine-grained classifi- 
cation usually focuses on the localization of parts [ , , ], 
and still uses manually designed features. Yao et al. [21] 
proposed to use a template-based approach for more pow- 
erful features, but the approach may be difficult to scale as 
the number of templates grow larger^. In addition, due to 
the lack of training data in fine-grained classification tasks, 
whether supervised feature learning is useful or not is un- 
clear yet. 

We performed classification on all the 200 bird species 
provided. For the image pre-processing we followed the 
same setting as [24, 21] by cropping the images to be cen- 



^http : / /www . vision . cal tech . edu/visipedia/ 
CUB-200-2011 .html 

^Due to computation complexity, some early work such as [ , ] do 
not scale up well, and only reported performance on subsets of the whole 
data [personal communication]. 



Table 2: Classification Accuracy on the 2011 Caltech- 
UCSD Birds dataset. 



Method 


Accuracy 


BoW SIFT Baseline [ ] 


18.60 


Pose Pooling + linear SVM [2 ] 


24.21 


Pose Pooling + SVM [ ] 


28.18 


K-means + linear SVM (as in [ ]) 


38.17 


PDL + linear SVM 


38.91 



tered on the birds using 1 .5 x the size of the provided bound- 
ing boxes. We then resized each cropped image tol28xl28 
to avoid any artifact that may be introduced by the varying 
number of local features. The training data are expanded 
by simply mirroring each training image. Then, we extract 
5x5 local whitened patches as we did for CIFAR-10, en- 
coded them using threshold encoding with a dictionary of 
size 1600, and performed 4x4 max pooling since the bird 
images are larger than those from CIFAR and STL. The dic- 
tionary is learned using our feature extraction pipeline, from 
an original set of 3200 patch-based clustering centers. 

Our classification results, together with previous state- 
of-the-art baselines from [24], are reported in Table 2. It 
is somewhat surprising to observe that feature learning pro- 
vides a significant performance boost in this case, indicating 
that in addition to part localization (which has been the fo- 
cus of fine-grained classification), learning appropriate fea- 
tures / descriptors to represent local appearances may be 
a major factor in fine-grained classification, possibly due 
to the subtle appearance changes for such tasks. As we 
have shown here, even simple and fully unsupervised fea- 
ture learning algorithms such as K-means and PDL could 
lead to significant accuracy improvement, and we hope this 
would inspire further advancement in the fine-grained clas- 
sification research. 



6. Conclusion 

We have proposed a novel algorithm to efficiently take 
into account the invariance of learned features after the spa- 
tial pooling stage. The algorithm is empirically shown to 
identify redundancy between codes learned in a patch-based 
way, and yields dictionaries that produces better classifica- 
tion accuracy than simple patch-based approaches. To ex- 
plain the performance gain we proposed to take a matrix 
approximation view of the dictionary learning, and show 
the close connection between the proposed methods and the 
Nystrom method. The proposed method does not introduce 
overheads during classification time, and could be easily 
"plugged in" to the existing image classification pipelines. 
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